Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A run context manager #484

Closed

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Oct 13, 2023

This PR adds a run-context manager. It is currently a draft PR, because there are open TODOs, but I wanted to start discussions early and would like to highlight that any input on this PR is appreciated.

What's in it

The run-context manager starts a subprocess and returns either a result structure or a generator to the code executed in the context.

If the protocol that is used to create the context does not inherit GeneratorProtocolMixIn, the context manager will execute the specified subprocess and, if the subprocess exits, return the result of protocol._prepare_result() in the as-variable.

If the protocol that is used to create the context does inherit GeneratorProtocolMixIn, the context manager will start the execution of the process and return the resulting generator in the as-variable. The generator can be used to fetch output from the subprocess and to trigger protocol callbacks, e.g. pipe_data_received, timeout etc. If the code exits the context, the context-manager will execute tuple() on the generator to ensure that the subprocess exit is properly handled, i.e. protocol-callbacks are called and the return code of the subprocess is fetched.

The context handler supports the specification of terminate- and/or kill-times. If those are specified the context manager might send a terminate and/or a kill signal to the subprocess. The following happens if terminate- and/or kill-time are specified:

  • In the non-generator case, the context manager will send a terminate-signal to the process if it does not exit with the terminate-time number of timeouts (if no timeout was specified by the caller, the context manager will use 1 second). It will send a kill-signal, if the process does not exist within (terminate-time + kill-time) number of timeouts.
  • In the generator case, the context manager will start the terminate/kill logic after the code in the context left the context. The context manager will try to exhaust the generator and send a terminate or kill signal to the process, if the termination or kill condition was reached before the generator was exhausted.

The PR comes with a number of tests that demonstrate the behavior.

Data processors

The PR includes an implementation of a generator that can be configured to read data from a base generator, perform different steps of data processing, and yield the result of the last processing step to the caller. There are currently three data processors:

  • A "decode" processor that decodes bytes into strings
  • A "linesplit" processor that splits bytes or strings at line-ends or at the file-end
  • A "jsonline" processor that converts lines into JSON-objects if possible.

Which processors are used is defined by putting them into a list. For example, the list [decode_processor, splitlines_processor] specifies that the content that is received from the underlying generator should first be decoded and then split into lines. The processing is "stream"-oriented, which means that data is returned as soon as possible, e.g. each individual line is returned the caller as soon as it was read.

Implementation remark

The run-context manager creates a subclass of the provided protocol class in order to amend the provided protocol class with the kill-logic. It uses this subclass to instantiate a runner. This construction allows the user to use any protocol class without the need to define yet another class, while it ensures that the class that is used by the runner supports the kill-logic.

TODO

  • Improve documentation
  • Use the run-context manager in respective source code in datalad-next

datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
Comment on lines 107 to 109
timeout: float | None = None,
terminate_time: int | None = None,
kill_time: int | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a precise definition in the documentation.

What is the difference between timeout and terminate_time?

Is kill_time on top of terminate_time?

What is the semantic difference that leads to one being called timeout and the other just be time?

Copy link
Member

@mih mih Oct 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add a few more comments after having tried to exercise this through in a concrete use case.

It is unclear where the timeouts apply, and thereby it is unclear how they must be used.

Example: I have a process that downloads some information. This information in reported in a json-lines format. The process is batch-mode of some kind (an input triggers a response, continued until stdin closes).

Now there are several layers a user of this helper needs to be aware of.

  • the layer that feeds the input/request
  • the process that downloads and eventually reports a JSON-line
  • the protocol that assembles a JSON-line and decodes it, once complete

Let's assume the remote bandwidth is really low (few bytes per second), and the information that needs to go into a single JSON-line is a few kb, so it needs minutes to assemble a response, and have the "outer" protocol generator yield something.

What would the timeouts apply to? I assume time0 is defined by the input item being fed to the process. Is it now the inner layer (i.e. nothing coming down the remote connection), or the outer layer (no JSON-line-based result yielded for X seconds) that would timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a generator-protocol, then the timeouts that aid the terminating and killing mechanism are only triggered when the context is left. I extended the documentation to clarify that. So for stdout_batchcommand and annexjson_batchcommand, the timeout-counting starts when the context is left, which means when the code in the context decides that all operations are done and exits the context.

Timeout-handling for the transmission should be done in the protocol that is used, e.g. StdOutCaptureGeneratorProtocol. For example, the protocol might send timeout-messages to the result generator and the in-context code can decide what to do. In another example, if the protocol raises an exception in its timeout-callback, the batchcommand-context would be left and the terminating- and killing- timeouts would be started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the first paragraph and appreciate the clarification.

I have difficulties following the reasoning in the 2nd paragraph. When you say "should be done", you are saying "it is not implemented anywhere"? Do we have any protocol implementation that behaves like this? If not, how do we know that it would work?

datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
datalad_next/runners/run.py Outdated Show resolved Hide resolved
@christian-monch christian-monch force-pushed the run-context-manager branch 3 times, most recently from 1f49f8e to 446159e Compare October 17, 2023 17:01
@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Attention: 50 lines in your changes are missing coverage. Please review.

Comparison is base (33b8f51) 92.41% compared to head (25bf3cf) 92.55%.
Report is 14 commits behind head on main.

❗ Current head 25bf3cf differs from pull request most recent head c42c588. Consider uploading reports for the commit c42c588 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #484      +/-   ##
==========================================
+ Coverage   92.41%   92.55%   +0.13%     
==========================================
  Files         126      138      +12     
  Lines        9577     9774     +197     
  Branches     1036     1086      +50     
==========================================
+ Hits         8851     9046     +195     
- Misses        704      712       +8     
+ Partials       22       16       -6     
Files Coverage Δ
datalad_next/iter_collections/gitworktree.py 100.00% <100.00%> (ø)
datalad_next/runners/batch.py 100.00% <100.00%> (ø)
datalad_next/runners/data_processor_pipeline.py 100.00% <100.00%> (ø)
datalad_next/runners/data_processors/__init__.py 100.00% <100.00%> (ø)
datalad_next/runners/data_processors/decode.py 100.00% <100.00%> (ø)
datalad_next/runners/data_processors/jsonline.py 100.00% <100.00%> (ø)
datalad_next/runners/data_processors/pattern.py 100.00% <100.00%> (ø)
datalad_next/runners/run.py 100.00% <100.00%> (ø)
datalad_next/runners/tests/test_batch.py 100.00% <100.00%> (ø)
datalad_next/runners/tests/test_data_processors.py 100.00% <100.00%> (ø)
... and 3 more

... and 10 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mih
Copy link
Member

mih commented Oct 18, 2023

I posted another round of comments. I am not sure if that is useful, I will stop until I got feedback.

FTR: At 28 commits it has reached my attention span limit. It would be useful for reviewing, if the PR is rebased and the commits resemble more reviewable units (new features and their tests, adoption of not features in previously existing code). ATM the commits resemble more the development history of the features themselves.

@christian-monch
Copy link
Contributor Author

I posted another round of comments. I am not sure if that is useful, I will stop until I got feedback.

FTR: At 28 commits it has reached my attention span limit. It would be useful for reviewing, if the PR is rebased and the commits resemble more reviewable units (new features and their tests, adoption of not features in previously existing code). ATM the commits resemble more the development history of the features themselves.

Will do that as soon as the batch code is settled

This commit adds a run-context-manager
that can execute subprocesses and
can guarantee that a subprocess is
terminated some time after exiting
the context.

To have a guarantee that the subprocess
is terminated, the user has to set
`kill_time` and preferably also
`terminate_time`.
This commit uses the run-context-manager
and the data processing pipeline with a
pattern_processor in `url_operations/ssh.py`.
Thie commit uses the data-processing-pipeline
in `datalad_next/runners/tests/test_run.py`.
This commit addresses comment
datalad#484 (comment)
about cluttering the namespace.

The commit introduces a convention
that is currently only used in the
protocol class `KillWrapper`. The
single keyword argument of the
class is prefixed with `dl_`. We
should claim the name space `dl_*`
for keyword arguments in the
protocols that we define.

This commit also fixes the return
value doc-string description of
`_create_kill_wrapper`.
Comment on lines 257 to 280
with run(
cmd=[
'git', 'ls-files',
# we rely on zero-byte splitting below
'-z',
# otherwise take whatever is coming in
*args,
],
protocol_class=StdOutCaptureGeneratorProtocol,
stdin=DEVNULL,
cwd=path
) as r:
# This code uses the data processor chain to process data. This fixes
# a problem with the previous version of the code, where `decode` was
# used on every data chunk that was sent tp `pipe_data_received`. But
# data is chunked up randomly and might be split in the middle of a
# character encoding, leading to weird errors.
yield from process_from(
data_source=r,
processors = [
decode_processor('utf-8'),
splitlines_processor(separator='\0', keep_ends=False)
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with run(
cmd=[
'git', 'ls-files',
# we rely on zero-byte splitting below
'-z',
# otherwise take whatever is coming in
*args,
],
protocol_class=StdOutCaptureGeneratorProtocol,
stdin=DEVNULL,
cwd=path
) as r:
# This code uses the data processor chain to process data. This fixes
# a problem with the previous version of the code, where `decode` was
# used on every data chunk that was sent tp `pipe_data_received`. But
# data is chunked up randomly and might be split in the middle of a
# character encoding, leading to weird errors.
yield from process_from(
data_source=r,
processors = [
decode_processor('utf-8'),
splitlines_processor(separator='\0', keep_ends=False)
]
)
with run(
cmd=[
'git', 'ls-files',
# we rely on zero-byte splitting below
'-z',
# otherwise take whatever is coming in
*args,
],
protocol_class=StdOutCaptureGeneratorProtocol,
processors = [
decode_processor('utf-8'),
splitlines_processor(separator='\0', keep_ends=False)
],
cwd=path,
) as r:
yield from r

I brought it up before, but still think this usage should not be more complex than this. If run() must not have processors= it could be renamed and the replacement could have it.

Copy link
Contributor Author

@christian-monch christian-monch Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think I missunderstood your earlier comments.

I have a commit which defines a StdOutCapturePipelineGeneratorProtocol. The protocol takes an optional processor list as a keyword-argument and sends the process-pipeline results to the generator. This is transparent to the run-context-manager and achieves the same terse "ls-files"-implementation, i.e. the code in the context is just:

yield from r

I did not yet put it in because it requires the user of run to select the "right" protocol and to provide a processor list in the protocol_kwargs. That seemed complex to me. And, in contrast to your suggestion (put the pipeline processing into run), which should support any protocol, my approach requires support by the protocol. It would be transparent for the run-context manager code though.

Your suggestion should in principle support any protocol. It would require different treatment of generator- and non-generator protocols, if we want to apply pipelines to non-generator protocols as well. The latter would require the result of the non-generator protocols to adhere to a fixed convention. That seems the be the case, all non-generator protocol results are a dictionary with the keys stdout and stderr. If we want to support processor pipelines on non-generator protocols, the run-context manager would have to pass the values of result_dict['stdout'] and possibly result_dict['stderr'] through the processing pipeline. This would also require a way to specify processor lists for stdout and stderr. Alternatively, we could limit the processors-keyword validity to generator protocols and stdout only.

I am not sure what would be the best approach here. WDYT?

Copy link
Contributor Author

@christian-monch christian-monch Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having worked with adding processing pipelines to non-generator protocols, I would say: "let's not support processing pipelines in non-generator protocols". The non-generator protocols are very much centered around joining individual byte-chunks in their pipe_data_received-method and decoding them into strings in their _prepare_result-method. Without overwriting the _prepare_result-method, it is not even possible to receive undecoded byte-content from a subprocess.

To make use of arbitrary pipelines, pipe_data_received and _prepare_result have to be overridden. That would introduce another "subtree" of protocol-class definitions. Alternatively, we could introduce a "synchronized"-run mode, that would collect all results from a generator-protocol-based run and return them. In fact, such an approach could make the tree of non-generator-protocol definitions obsolete, any generator-protocol could be used in a non-generator way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought the processing protocol classes back. Using those classes the code that executes git ls-files looks like this:

    with run(
            cmd=[
                'git', 'ls-files',
                # we rely on zero-byte splitting below
                '-z',
                # otherwise take whatever is coming in
                *args,
            ],
            protocol_class=StdOutCaptureProcessingGeneratorProtocol,
            stdin=DEVNULL,
            cwd=path,
            protocol_kwargs=dict(
                processors=[
                    decode_processor('utf-8'),
                    splitlines_processor(separator='\0', keep_ends=False)
                ]
            )
    ) as r:
        # This code uses the data processor chain to process data. This fixes
        # a problem with the previous version of the code, where `decode` was
        # used on every data chunk that was sent tp `pipe_data_received`. But
        # data is chunked up randomly and might be split in the middle of a
        # character encoding, leading to weird errors.
        yield from r

Comment on lines 1 to 7
""" This module contains data processors for the data pipeline processor

The data processors contained here are:

- decode_utf8_processor

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this documentation is accessible in the renderer docs. This hides the fact that all the data_processors modules are documented in the exact same way, making them indistinguishable from each other.

I think the pattern to manually list the main components in a module in the top-level docstring is not optimal. In the renderer documentation this would e automatic/obvious already. Here it adds content that needs to be kept in sync. I believe an __all__ declaration would be a more appropriate alternative.

In general I think that that sorting of code in all the modules is artificially creating the need to describe what matters most. The "public" component tends to be last in the file (although it could be first).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...]
In general I think that that sorting of code in all the modules is artificially creating the need to describe what matters most. The "public" component tends to be last in the file (although it could be first).

An old C/C++-habit that is seemingly difficult to shake. But I will manage :-)

This commit adds pipeline processing generator
protocols. The protocols can be initialized
with a list of processors that they will apply
to the data that they receive.

There is currently no equivalent for
non-generator protocols. The reason for that
is that those protocols expect every data
element that they handle to be `bytes`.
That does not fit well with processors
like `splitlines_processor` or
`decode_processor`.

We could could implement
a new non-generator protocol "family"
that collects whatever comes from a
processor pipeline as the values of
`stdout`- and `stderr`-keys of the
result dictionary. But it is currently
not clear when and where that would
be used.
@mih
Copy link
Member

mih commented Oct 25, 2023

I want to take one aspect from above and bring it up for a dedicated discussion:

I would say: "let's not support processing pipelines in non-generator protocols"

What about "let's not support non-generator protocols" instead?

This PR adds 2k+ lines already. This is ~10% of the entire code base. It is a LOT. I get the impression that some of that is just due to "need to support old stuff that happened to be done". But this is not what this extension is about.

If we can find a paradigm that can be supported with a lean implementation that does not multiply the code base, I would very much prefer that. I do not mind is that has consequences, as long as they are spelled out.

What would happen, if only generator-protocols are supported? What would be the performance impact on real use cases (I am specifically not talking about theory -- I understand that thread-management adds complexity).

@christian-monch
Copy link
Contributor Author

I want to take one aspect from above and bring it up for a dedicated discussion:

I would say: "let's not support processing pipelines in non-generator protocols"

What about "let's not support non-generator protocols" instead?

That is a good question. I think it is very interesting idea (I had a similar hunch when I wrote the developer handbook). From the top of my head, an probably incomplete initial answer:

This PR adds 2k+ lines already. This is ~10% of the entire code base. It is a LOT. I get the impression that some of that is just due to "need to support old stuff that happened to be done". But this is not what this extension is about.

If we can find a paradigm that can be supported with a lean implementation that does not multiply the code base, I would very much prefer that. I do not mind is that has consequences, as long as they are spelled out.
[...]

If we only support generator-protocols, we might get away with exactly one protocol that just forwards all events to a result generator. Check the GenericGeneratorProtocol example in the "Runner with Generators Protocols - Developer handbook" (https://datalad--476.org.readthedocs.build/projects/next/en/476/developer_guide/generator-runner.html) for an example. That approach moves almost all processing out of protocol callbacks into the code that uses the runner.

I think it is conceptually simpler because there are no more "opaque" operations performed in protocol classes. The drawback is that it would require new abstractions to write nice, correct, and re-usable code that processes the events in the for event in event_source:-loop. The data processing pipeline is one concept that could be useful here. So the code base would probably not be reduced in total.

The sweet spot is probably somewhere in the middle. The *ProcessingGeneratorProtocol-classes might be a combination of abilities and flexibility that could: a) reduce the overall number of classes, while b) support a large range of usage scenarios.

[...] What would happen, if only generator-protocols are supported? [...]

Within datalad-next:
  1. A reduced number of protocol classes.
  2. Possibly shorter protocol class names due to the removal of the Generator-part.
  3. If we still want to support a synchronous mode, i.e. a mode that behaves as if there was a non-generator-protocol, we could use a wrapper around asynchronous runs. The wrapper would collect outputs from the generator until the process terminates and return those outputs together with the exit code to its caller. (A generic wrapper would either require that all generator-protocols signal data sources, i.e. stdout or stderr, in the same way, or that a provided function is capable of determining the data source.)
Within datalad:
  1. A number of runner invocations would have to be changed, to use a wrapper function as described above.
Overview of protocols that are currently used in datalad-next:

datalad-next currently uses the following protocols in its application code, i.e. not in the tests:

  • NoCaptureGeneratorProtocol
  • StdOutCaptureGeneratorProtocol
  • StdOutCaptureProcessingGeneratorProtocol
  • GeneratorAnnexJsonProtocol (from datalad-core)

The tests use the following protocols in addition:

  • StdOutErrCaptureProcessingGeneratorProtocol
  • NoCapture (from datalad-core)
  • StdOutCapture (from datalad-core)
  • StdErrCapture (from datalad-core)
  • StdOutErrCapture (from datalad-core)

What would be the performance impact on real use cases (I am specifically not talking about theory -- I understand that thread-management adds complexity).

The performance impact should be negligible because the non-generator protocols and the generator protocols use the same thread-based implementation. Compared to non-generator protocols the generator protocols add the overhead of:

  1. sending results to the result generator queue
  2. going through the result generator state machine, when the next result is fetched

which does not create a lot of performance overhead.

As a remark: if performance becomes a problem, we could look into a select- or poll-based runner on posix, which would expose the same interface as ThreadedRunner and be a drop-in replacement that could automatically be used in posix-environments.

christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Oct 26, 2023
This commit addresses comment:
datalad#484 (comment)
It removes unrendered doc strings and
reorders data processor code so that
the data processor is on top of the
source files. In addtion it adds
`__all__` variables to limit imports
to the objects that constitute the
user-interface.

The commit also adds doc-string
rendering for the module
`datalad_next.runners.data_processors`
This includes markup fixes
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Oct 26, 2023
This commit addresses comment:
datalad#484 (comment)
It removes unrendered doc strings and
reorders data processor code so that
the data processor is on top of the
source files. In addtion it adds
`__all__` variables to limit imports
to the objects that constitute the
user-interface.

The commit also adds doc-string
rendering for the module
`datalad_next.runners.data_processors`
This includes markup fixes
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Oct 26, 2023
This commit addresses comment:
datalad#484 (comment)
It removes unrendered doc strings and
reorders data processor code so that
the data processor is on top of the
source files. In addtion it adds
`__all__` variables to limit imports
to the objects that constitute the
user-interface.

The commit also adds doc-string
rendering for the module
`datalad_next.runners.data_processors`
This includes markup fixes
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Oct 26, 2023
This commit addresses comment:
datalad#484 (comment)
It removes unrendered doc strings and
reorders data processor code so that
the data processor is on top of the
source files. In addtion it adds
`__all__` variables to limit imports
to the objects that constitute the
user-interface.

The commit also adds doc-string
rendering for the module
`datalad_next.runners.data_processors`
This includes markup fixes
@mih mih mentioned this pull request Oct 26, 2023
This commit addresses comment:
datalad#484 (comment)
It removes unrendered doc strings and
reorders data processor code so that
the data processor is on top of the
source files. In addtion it adds
`__all__` variables to limit imports
to the objects that constitute the
user-interface.

The commit also adds doc-string
rendering for the module
`datalad_next.runners.data_processors`
This includes markup fixes
@mih
Copy link
Member

mih commented Oct 26, 2023

Registering #497 (comment) here.

I did not expect that, but I have to admit that I lost track a bit.

@christian-monch
Copy link
Contributor Author

This PR is superseded by PR #516. I will close it for now.

@christian-monch christian-monch deleted the run-context-manager branch July 16, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants